Zhang et al. BMC Systems Biology 2012, 6(Suppl 3):S9 
http://www.biomedcentral.eom/1752-0509/6/S3/S9 

Systems Biology 



RESEARCH Open Access 



A dynamic time order network for time-series 
gene expression data analysis 

Pengyue Zhang 1+ , Raphael Mourad 1 ' + , Yang Xiang 2 , Kun Huang 2 , Tim Huang 3 , Kenneth Nephew 4 , Yunlong Liu 1 , 
Lang Li 1 

From The International Conference on Intelligent Biology and Medicine (ICIBM) 
Nashville, TN, USA. 22-24 April 2012 




Abstract 

Background: Typical analysis of time-series gene expression data such as clustering or graphical models cannot 
distinguish between early and later drug responsive gene targets in cancer cells. However, these genes would 
represent good candidate biomarkers. 

Results: We propose a new model - the dynamic time order network - to distinguish and connect early and later 
drug responsive gene targets. This network is constructed based on an integrated differential equation. Spline 
regression is applied for an accurate modeling of the time variation of gene expressions. Then a likelihood ratio 
test is implemented to infer the time order of any gene expression pair. One application of the model is the 
discovery of estrogen response biomarkers. For this purpose, we focused on genes whose responses are late when 
the breast cancer cells are treated with estradiol (E2). 

Conclusions: Our approach has been validated by successfully finding time order relations between genes of the 
cell cycle system. More notably, we found late response genes potentially interesting as biomarkers of E2 treatment. 



Background 

Breast cancer represents a major public health issue since 
it comprises 22.9% of all cancers in women and it is an 
important cause of death [1]. Some breast cancers are 
sensitive to hormones such as estrogen (E2) [2]. Thus it 
is possible to treat these cancers by blocking the effects 
of these hormones, using for instance tamoxifen [3]. The 
discovery of biomarkers of the response to drugs is an 
important task in medical research because it helps know 
if a drug is effective for a specific patient and how it is 
metabolized by his organism. Biomarkers play thus an 
important role in personalized medicine, such as in the 
choice of the most relevant treatment. 

Biomarkers often refer to proteins measured in the 
blood whose concentrations reflect the presence or the 
severity of the disease. In the case of estrogen treatment, 
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biomarkers can be seen as parameters reflecting the 
effects of the drug on the patient. The biomarkers of hor- 
mone therapy of the breast cancer is not well developed. 
For instance, although tamoxifen's pharmacology 
mechanism is well known, its clinical biomarker is not 
well established yet. Understanding the cascade of estro- 
gen signaling pathway is the key to study the potential 
biomarkers. 

Gene expression-based biomarker discovery has 
demonstrated efficiency for breast cancer [4,5] . Standard 
methods rely on computing correlations between gene 
expressions and drug treatment status. Simple statistical 
procedures are used such as t-tests to assess the signifi- 
cance of over- or under-expressions of genes before and 
after treatment in steady-state analysis [6]. Clustering has 
also been successfully used for revealing particular pat- 
terns of expression [7] . 

Unfortunately standard methods might fail to reveal 
key biomarkers, since they do not take into account the 
temporal aspect of gene expression and the complex net- 
work of gene regulation. To tackle this issue, the analysis 
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of time series data through dynamic networks represents 
efficient alternatives [8]. In this context, three main 
approaches can be distinguished: dynamic Bayesian net- 
works, information-theoretic networks and ordinary dif- 
ferential equations. Dynamic Bayesian networks (DBNs) 
have been successfully applied to infer causal gene net- 
works [9,10]. Conditional independences encoded in 
DBNs guarantee to infer direct relations between genes. 
The second approach consists in inferring the structure 
of dependences through an information-theoretic frame- 
work [11,12]. Most notably, the data processing inequal- 
ity principle helps discard the majority of indirect 
dependences without involving time consuming algo- 
rithms such as those for DBNs. The last method relies on 
ordinary differential equations (ODEs) [13,14]. In this 
method, changes of gene expression are related to each 
other through a system of differential equations. Most 
notably, this method accurately and explicitly models the 
continuous time aspect of gene expression. Recently a 
combination of ODEs and DBNs has been proposed for 
taking into account both causal discovery (DBNs) and 
accurate modeling (ODEs) of gene expression [15]. 

Late response genes might represent relevant biomar- 
kers because they are more stable over the time. Our 
approach relies on this biological aspect of biomarker 
discovery. To identify late response genes, we propose a 
new model based on a dynamic time order network 
(DTON). The model interpretation is simple and intui- 
tive: it reflects which genes express in the early times and 
which ones in the late times after the hormone treat- 
ment. The DTON is constructed based on an integrated 
differential equation. Spline regression is applied for an 
accurate modeling of the time variation of gene expres- 
sions. A likelihood ratio test is implemented to infer the 
time order of any gene expression pair. The advantages 
of this modeling approach are numerous: (i) closed-form 
expressions of ODEs, (ii) accurate modeling of the time 
series data by using spline regression and by integrating 
differential equations, and (iii) model learning involving 
simple regressions quick to compute and only a few para- 
meters have to be estimated. The method has been vali- 
dated by successfully finding time order relations 
between genes of the cell cycle system. Most importantly, 
we found late response genes as candidate biomarkers of 
E2 treatment. 

This paper is organized as follows. Section Materials and 
methods first describes experiments and data preproces- 
sing. Late response genes are defined and discussed. Then 
the dynamic time order network and its model learning 
are presented. It is described how dynamic time order 
relations between genes are inferred through a likelihood 
ratio test. The next section illustrates our method on real 
data analysis. Our model is validated with the well-known 



cell cycle system. Late response genes of E2 treatment are 
discovered. Finally, the last section concludes and points 
out promising perspectives. 

Materials and methods 

Experiment and data preprocessing 

The gene expression data come from estrogen stimu- 
lated ZR_75_1 cells. G 0 - Gi synchronization cells were 
treated with 10~ 8 M of 17 ft - estradiol (E2). Then RNA 
was extracted from the cells before (0) or after 1, 2, 4, 6, 
8, 12, 16, 20, 24, 28 and 32 hours of stimulation. For 
more details, the reader is referred to the original study 
[16]. There are 48702 probes in the original study and 
some of them are duplicated. Duplicated probes are 
averaged. Then only highly differentially expressed genes 
are considered through the following method. Standard 
deviation and mean were computed for each mRNA. A 
gene is considered as not differentially expressed if its 
standard deviation over its mean is small. At this point, 
we chose 0.15 as threshold. Finally, we only kept 5003 
genes with high variation of their expression. The loga- 
rithmic concentration ratio (LCR) at every time point is 
used. Let C t denotes the concentration at time point t 
for a gene, then the LCR at time point t would be log 

The LCR indicates how much the concentration 
increases or decreases from the concentration at the 
first time point. In order to unify the variance for differ- 
ent genes, we standardized the LCRs at each time point. 

Late response gene 

In breast cancer cells, Cicatiello et al. [16] showed that 
all the major time dependent gene expression profile 
clusters follow two major patterns: (i) go up or down, 
then stay flat; and (ii) go up or down first, stay flat, then 
go down or up, respectively. These patterns can be cap- 
tured by a natural cubic spline function divided in three 
parts using two knots. The early response genes are 
thus defined as either up- or down-regulated genes 
before 5.333 hours, following E2 stimulation. The late 
response genes are defined as either up- or down-regu- 
lated genes after 17.333 hours. The time points 5.333 
hours and 17.333 hours represent the 33th and 67th 
percentiles of the sampling time points. 

Biologically, we favor late response genes because of 
their clinical implications. To check whether a drug 
works in human, i.e. inhibiting or simulating the target, 
one or multiple reliable biomarkers are useful to indi- 
cate the drug effects. An early response gene may not 
be predictive for the long term effect of the drug. It is 
always desirable to use a biomarker that can predict a 
sustainable effect of the drug. Therefore, a late response 
gene represents a better biomarker than an early 



Zhang et al. BMC Systems Biology 2012, 6(Suppl 3):S9 
http://www.biomedcentral.eom/1752-0509/6/S3/S9 



Page 3 of 12 



response one. In our dataset, responsive genes after 
17.333 hours following E2 treatment are likely to be the 
best biomarkers. 

The dynamic time order relationship 

Let fi(t) and f 2 {t) represent the LCR curves of two genes 
Gx and G 2 over the time t, as depicted in Figure la. 
Suppose Gi and G 2 have a dynamic time order relation 
such that the expression of G 2 is later than the one of 
G\. This relation is denoted as Gi — > G 2 . Then the 
changing rate of G 2 should be related to the LCR of Gi 
and itself [8]. The model is an ODE: 



d/ 2 (Q 
dt 



fei/iW +kih{t). 



(1) 



In Equation (1), represents the changing rate of 
G 2 expression. Alternatively, Equation (1) can be 
expressed by integration: 



/ 2 (t) = ki F 1 {t) + k 2 F 2 {t). 



(2) 



In Equation (2), Fi(t) and F 2 (t) represent the cumula- 
tive expression of Gi and G 2 . The integration of the 
ODE can help to better distinguish which gene is firstly 
expressed in a non-trivial scenario, such as the one pre- 
sented in Figure lb. In this example, we can see that it 
is possible to infer the dynamic time order relation 
between Gi and G 2 only during the early time (because 
only in the early time we observe a significant difference 
between the two rates). By integrating the ODE (see 
Equation (1)), the model can take into account all the 
variation of the gene LCR (in early and late times). Note 
that this dynamic time order relation does not imply 
any causal relation between two genes but only indicates 
which one is expressed after the other. 

Natural cubic spline regression 

In order to apply the integrated ODE model (Equation 2), 
a smooth curve is required to fit gene expression over the 



time. For this purpose, natural cubic spline regression 
(NCSR) [17] represents a good choice, since it provides a 
good trade-off between fit to data and model complexity. 
NCSR is a third-order polynomial function: 



/i(t) =/3o + 0i t + p 2 t 2 + fo t 3 



(3) 



with fi(t) the LCR of gene G,. Observations j, for a 
gene G, are regressed by the NCSR function: 



y t = 0 O + 01 t + 02 t 2 + 03 £ 3 + E SRi , 



(4) 



with ssm ~ A/"(0, er s 2 Ri ) • ssm and aj Ri respectively 
denote the residuals of the spline regression and their 
variance associated with the gene G,. 

The time interval of our gene expression data is t e 
[0; 32] hours. We divide the function f t into three parts 
using two knots at 5.333 hours and 17.333 hours. The 
decomposition of the cubic function using knots is pre- 
sented in Additional file 1. 

Let = (J3 iJ0 , p ifl , /3 iJ2 , /3 iJ3 ) T and t = (1, t, t 2 , t% then 
Yi ~ AA(t0y, ag Ri ) at time t. The distribution of yt is 
written as: 



P{y.) 



: exp 



lira. 



SRi 



2°SRi 



(5) 



with j=l + J2f, =1 l(f > K),) ■ The value of j refers to 
the first, second or third interval. 

In our study, we have 12 different time points t e {0, 
1, 2, 4, 6, 8, 12, 16, 20, 24, 28, 32} and their associate 
LCRs for the gene G ; are the vector y, = (y i0 , ya 2 ). 
Based on Equation 5, the likelihood for the NCSR 
model of gene G, for the set of 12 independently and 
identically distributed (i.i.d.) samples T> is: 



/ (y„ -t/9„) 2 \ 



(6) 




Figure 1 Logarithmic concentration ratios of two genes G, and G 2 a) A scenario showing a trivial order between the two genes, b) A more 
complex scenario showing an non-trivial order. 
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The parameters are learned by maximizing the likeli- 
hood in Equation 6 with constrains (see Additional file 1). 
There are 12 parameters in the cubic function. However, 
only 4 out of the 12 parameters are free, as constrains 
must be satisfied. If we set fij 2 = {Pao, Pm, Pin, PmY as 
the free parameters, we can solve parameters /J,i and /} l3 
(see Additional File 2). 

We can simplify the joint likelihood in Equation 6 as 
follows: 



m>«, fa, fa, <4.ri>) - P»<4,) 2 x iwk, «p 



where t* can be solved by the following way: 



(7) 



3fC 2 r 3 - 288K 2 £ 2 + 9K^t - 3Ki 



(0 < t < K,) 
(Ki < t < K 2 ) 



(8) 



The maximum likelihood estimator of /J ;2 for gene G, 
can be computed through a multiple linear regression: 



(9) 



with T* a 12-by-4 matrix (presented in Additional 
file 3). In the matrix T*, each row k corresponds to 
the vector t s! at the time point % of the vector 
T T = (0, 1, 2, 4, 6, 8, 12, 16, 20, 24, 28, 32) T . As 
previously mentioned, using the parameters j$ j2 , we 
can estimate the parameters p n and j$ i3 (see Addi- 
tional file 2). Then with all these parameters, we can 
obtain a smooth curve to represent f L {t) for gene G, in 
the whole time interval 0 to 32 hours, as Figure 2 
shows for the gene APLP2. Therefore the ODE in 
Equations 1 and 2 can be applied. 

Time order determination 

Based on Equation 2, the dynamic time order relation- 
ship between two genes can be learned using the follow- 
ing multiple linear regression: 



y it = b i0 + bnFi (t) + b i2 F 2 {t) + e M m, 



(10) 



with s MRi ~ Af(0, <y^ Ri ) ■ s MRi and a^ Ri respectively 
denote the residuals of the multiple regression and their 
variance associated with gene G,. The response variable 
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Figure 2 Gene expression of APLP2 fitted by natural cubic spline regression with 2 knots K, and K 2 . K, = 5.333 hours and K 2 = 17.333 
hours. 
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y u is the LCR of gene G; at time t and the predictor 
variables are integrations of the cubic functions at time 
t. For a predictor variable, the integration F; of piecewise 
cubic functions fa, f a andja is calculated as follows: 



m) = fof*(t), (o<t<K 1 ) 

m) = Cx + P Ki Mt), {K 1 <t<K 2 ) 
F i (t) = C 2 + /4/, 3 (0, {K 2 <t<32), 



(11) 



where C 



i= / /n(f) and C 2 = fa{t) + Ci are 
Jo JK, 

constant terms. They vary for different gene LCRs. 

We apply the model in Equation 10 to every pair of 

genes to determine whether there is a dynamic time 

order relation between them. The pairwise regression 

models for two genes G l and G 2 are: 

y 1 = Xbi + s MR1 (12) 

y 2 = Xb 2 + s MR2 , (13) 

with s M Ri ~ Af{0, a* Ri ) . Vectors y[ = (y 10 , ym) T 
and y\ = (y 2 o, ■ ■ ■ , y 2 3 2 ) T are the LCRs for the pair of 
genes G 1 and G 2 , and bj = (fyo, ba, bi 2 ) T are the associ- 
ate parameters in the model presented in Equation 10. Let 
Fi(t) be the integration of ft{t) getting from Equation 11 
and Fj = (F,(0), . . . , F,(32)) T be the function values for 
Fi(t) at each time point t. Then the predictor variable is 
X = (1, Fj, Fj). 

Thus in Equations 12 and 13, values of y,- (left hand 
side) come from data and values of X (right hand side) 
result from the integration of the NCSR functions. For 



the pair of genes Gi and G 2 , the model in equation 12 
represents the dynamic time order relation G 2 — » Gi 
and the model in equation 13 represents the dynamic 
time order relation Gi — > G 2 . 

Pairwise regressions are then computed for all pairs of 
genes and the log-likelihoods are calculated (see Addi- 
tional file 4). In order to find whether a pair of genes 
has a dynamic time order relation, we look at their log- 
likelihood difference. If two genes present a dynamic 
time order relation, the regression relying on the true 
relation will have a better log-likelihood value than the 
regression based on the wrong relation, as Equations 12 
and 13 represent two different dynamic time orders. 

Network construction 

After determining the time order relationships, an «-by- 
n adjacency matrix (n is the number of genes) is con- 
structed whose weights are the previously computed 
log-likelihood differences. In the matrix, for a couple of 
genes, only the positive log-likelihood difference value is 
kept and the negative (symmetric) log-likelihood differ- 
ence value is set to 0. This adjacency matrix represents 
the complete directed graph of time order relationships. 
Small network 

When the network is small (less than one hundred 
nodes), it is interesting to keep as much as possible 
information about time order relations. The best strat- 
egy in this case is fine tune a threshold used to remove 
non-significant edges. For this purpose, a simple and 
efficient approach is the use of the median or other 
quantiles of the distribution of log-likelihood difference 
values. Then a simplification step is used to remove 



Time of 
expression 



•••• 




Late response genes 
(candidate biomarkers) 



Figure 3 Dynamic time order network and its biological interpretation. The blue nodes indicate the late response genes whereas the red 
nodes point out the remaining genes. 
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redundant edges. For instance, when one observes A — > 
B and B — > C, then A — > C is considered as redundant 
and is removed. For graph drawing, the Sugiyama's algo- 
rithm [18] provides a hierarchical display which is parti- 
cularly relevant for reflecting time order relations. 
Genome-wide network 

When the network is huge, such as the genome-wide 
network from the microarray data, the previous 
approach cannot be used. The reason is that a low 
threshold value will create a network highly connected 



which is too complex to manipulate and to visualize, 
whereas a high threshold value will lead to a graph with 
many connected components from which it will only be 
possible to infer time orders between connected genes. 
To tackle this issue, we compute the so-called maxi- 
mum weight spanning tree (MWST). This graph pre- 
sents several advantages: (i) its tree shape is a very 
simple structure easy to manipulate and visualize, and 
(ii) every node is connected by a path such that we can 
access to the time order relation between each gene. 




Zoom 

Figure 5 Genome-wide dynamic time order network. It has been computed for all the 5003 genes. Big circles represent incoming-edge hubs 
at the center (blue) connected to a very large number of nodes (red). The color code is the same as in Figure 3. 
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Besides, the MWST can be quickly computed in O 
(n 2 logn) through the Prim's algorithm [19]. Regarding 
graph drawing, the Sugiyama's algorithm cannot be used 
when the graph is too huge. Instead we prefer to display 
it using an algorithm specific to tree drawing, the Bub- 
ble tree algorithm [20]. 

Biological interpretation of the model 

The dynamic time order network (DTON) has a biologi- 
cal interpretation. It is illustrated in Figure 3. In this 
network, late response genes are hubs which are con- 
nected by many incoming pathways. Thus the identifica- 
tion of these hubs helps find candidates for biomarkers 
of breast cancers. Based on this idea, we propose a cri- 
terion to identify late response genes in the network. 
Late response genes are defined as nodes only con- 
nected to incoming edges, and the more incoming edges 
a node has the later is considered its response. We call 
these nodes "incoming-edge hubs". 

Implementation 

Our learning method is implemented in R. The R source 
code is available on request. For graph drawing and dis- 
play, the software Tulip (http://tulip.labri.fr/TulipDrupal/) 
was used. It is a user-friendly tool able to deal with about 
one million nodes. 

Results and discussion 

Reproducing the cell cycle temporal system 

The cell cycle temporal system represents a good bench- 
mark for evaluating our method. In this subsection, in 
order to see if we can reproduce the time order relations, 
we focused on key cell cycle genes. Twelve mRNA 
expression data were selected, which include cyclin Al 
(CCNA1), cyclin A2 (CCNA2), cyclin Bl (CCNB1), cyclin 
B2 (CCNB2), cyclin Dl (CCND1), cyclin D3 (CCND3), 
cyclin El (CCNE1), cyclin E2 (CCNE2), cyclin-dependent 
kinase 1 (CDK1), cyclin-dependent kinase 2 (CDK2), 
cyclin-dependent kinase 4 (CDK4) and cyclin-dependent 
kinase 6 (CDK6). Regressions have be computed for all 
pairs of genes. Then, the network of cell cycle genes has 
been computed by thresholding using the median of the 
log-likelihood differences. After simplification, the 
inferred network is composed of 27 time order relations. 
It is depicted in Figure 4a. The reference network of cell 
cycle genes is displayed in Figure 4b. Over the 27 time 
order relations inferred, 24 correspond to the reference 
network, 0 are wrong and 3 cannot be checked from the 
reference network (because the reference network is not 
enough accurate). The network is thus recovered with at 
least 89% of accuracy. More notably, the network points 
out 5 incoming-edge hubs: CDK2, CCNB2, CDK1, 
CCNB1 and CCNA2 (these nodes are colored in blue in 
Figure 4a). The genes CCNB2, CDK1, CCNB1 and 



CCNA2 correspond to late response genes in the refer- 
ence network. Regarding CDK2, it should be considered 
as an intermediate response gene. Compared to the other 
hubs which all show 5 incoming edges, CDK2 only pre- 
sents 3 incoming edges. 

Genome-wide network 

For genome-wide network modeling, an MWST has been 
constructed from all pairwise regressions on the 5003 
genes. The network is depicted in Figure 5. We observe 
that this network is composed of several large incoming- 
edge hubs and reflects a star shape topology. The 10 most 
important hubs are listed in Table 1. We observe that 
CELCAM6 is connected to 2783 incoming edges (CEL- 
CAM6 is magnified in the Figure 5). Other large hubs are 
EPAS1, CALB2, UPK1A, KRT81, PDZK1, MT2A, 
FANCD2, C20orfl60 and WDR51A, in the decreasing 
order of importance. The profiles of expression over time 
are presented in Figure 6. All these profiles reflect a late 
under- or overexpressed response. CELCAM6, EPAS1, 
UPK1A and KRT81 are genes whose expressions decrease 
over the time, whereas CALB2, PDZK1, MT2A, FANCD2, 
C20orfl60 and WDR51A are overexpressed after E2 
treatment. 

The identification of late response genes does not repre- 
sent a well-studied issue. Most notably, no dedicated 
method has been developed for this purpose. Nevertheless, 
we tried to compare our method with standard approaches 
in gene expression analysis: agglomerative hierarchical 
clustering (AHC) and t-tests. On the one hand, AHC is a 
well-used tool to cluster gene expression profiles. After 
computing AHC, we used the silhouette criteria to deter- 
mine the optimum number k of clusters [21]. We obtain 
the best silhouette values for k = 5. However, when we 
looked at the clusters, we were unable to identify any clus- 
ter corresponding to late response genes. We thus tried 
with higher values of k. With k = 20, we are able to more 
accurately distinguish different trends in gene expression 



Table 1 List of the 10 most important incoming-edge 
hubs. 



Gene 


Number of incoming edges 


Expression 


CEACAM6 


2783 


Underexpressed 


EPAS1 


417 


Underexpressed 


CALB2 


250 


Overexpressed 


UPK1A 


171 


Underexpressed 


KRT81 


150 


Underexpressed 


PDZK1 


130 


Overexpressed 


MT2A 


102 


Overexpressed 


FANCD2 


78 


Overexpressed 


C20orf 1 60 


62 


Overexpressed 


WDR51A 


49 


Overexpressed 
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Figure 6 Gene expression profiles for the 10 most important incoming-edge hubs LCR: logarithmic concentration ratio 



(see Figure 7). However, it is still hard to identify late 
response genes. The clusters 2 and 9 might represent can- 
didates for over-expressed and under-expressed late 
response genes, respectively. On the other hand, we used a 



t-test strategy. We obtained better results. Our strategy 
was the following: (i) first we selected genes whose devia- 
tions of absolute LCR values from 0 for the first time 
points 0, 1, 2, 4, 6, 8, 12 and 16 are non-significant (p - 
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Figure 7 Cluster profiles obtained with agglomerative hierarchical clustering, for all the 5003 genes 
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value >quantile{0.6)), and (ii) then, from these selected 
genes, we only kept those whose absolute LCR values for 
the last time points 20, 24, 28 and 32 are significantly 
different from 0 (p - value <quantile(0. 05)). Profiles of 
selected genes are depicted in Figure 8. With t-tests we 
observe a better identification of late response genes than 
with AHC. However we notice that some of these gene 
expressions oscillate between over- and under-expression 
for the last time points. Figure 9 shows the Venn diagrams 
for the comparison of results between DTON (our 



method), the AHC and the t-test strategy. The t-test strat- 
egy and DTON are both very specific with a few number 
of genes identified: 91 and 27 over-expressed genes, and 
13 and 8 under-expressed genes for t-tests and DTON, 
respectively. For over-expressed genes, more than half of 
the genes found with DTON are also identified with t- 
tests. Regarding under-expressed genes, few genes are 
shared. Comparatively, AHC is much less specific with 
around 2600 over-expressed and around 200 under- 
expressed genes. It is thus not surprising that AHC shares 
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Figure 9 Venn diagram to compare the late response genes identified from the dynamic time order network (DTON), the 
agglomerative hierarchical clustering (AHC) and the t-tests. 
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a large proportion of over-expressed genes with DTON 
and t-tests. 

We also search in the literature if the late response 
genes identified with our method can be good candidate 
biomarkers. Since biomarkers are molecules that are 
observed in cancer patients but not in healthly people, 
there are likely to be genes overexpressed after E2 treat- 
ment. Among the overexpressed hubs of the network, 
CALB2, PDZK1, MT2A and FANCD2 are well-known 
in the literature as diagnostic marker of breast cancer 
and E2 response [22-25]. Besides, C20orfl60 is reported 
in the Genes-to-Systems Breast Cancer Database (http:// 
www.itb.cnr. it/breastcancer//index.html). WDR51A (also 
called POC1A) is found associated with breast cancer in 
The Human Protein Atlas (http://www.proteinatlas.org/). 

Conclusion 

Based on experimentations carried out on time-series 
gene expression data, our dynamic time order network 
has been shown to efficiently distinguish and connect 
early and late response genes. First, our model has faith- 
fully reproduced the cell cycle temporal system. Over 
the 27 time order relations inferred, 89% correspond to 
the state-of-art network, 11% cannot be checked, but no 
one are false. Second, our approach has been success- 
fully applied to a genome-wide level. The learning 
method has been able to process five thousands genes 
and the network simplification through the maximum 
weighted spanning tree provided a graphical display of 
the huge network. Most notably, several incoming-edge 
hubs showing very high connectivity have been discov- 
ered. All these hubs showed late gene response profiles. 
Regarding those which are overexpressed over the time, 
they have been reported as biomarkers of breast cancer 
and E2 response in the literature and databases. 

The comparison of results with other approaches is 
not straightforward, since our method is the only one 
dedicated to identify late response genes. When com- 
pared with standard methods in gene expression analy- 
sis, our approach yielded specific results, contrary to 
agglomerative hierarchical clustering. Moreover it does 
not need any complex thresholding such as with a t-test 
strategy. It is worth noting that all genes identified with 
DTON showed late responses, while this is not the case 
with the t-test strategy. Besides, our approach is based 
on the comparison of gene expression integrals com- 
bined with cubic spline regression, thus offering an 
accurate assessment of time order relations. 

The discovery of biomarkers is one of the application 
of our model. The distinction between early and late 
response genes is also an important application in devel- 
opmental biology where the understanding of the tem- 
poral aspect of gene expression is a key issue such as 
for cell differentiation. For the moment, we mainly 



focused on the identification of late response genes. The 
use of another graph modeling would be more efficient 
for pointing out early response genes than the MWST 
which tends to display incoming-edge hubs. 

Additional material 
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